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BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to the retrieval of information from a database and, more 
10 particularly, to the indexing of information for retrieval from a database in a manner that 
compresses the index so as to consume less storage memory. 

2. Discussion of the Related Art 

The purpose of an information retrieval (IR) system is to search a database and 
return information (hereinafter, the term documents will be used to refer to returned 
15 information, though such information need not actually be documents in the 

word-processing sense, but rather may be any information, including web pages, numbers 
alphanumerics, etc., or pointers or handles or the like thereto) in response to a query. 

Most high-precision IR systems in use today utilize a multi-pass strategy. Firstly, 
initial relevance scoring is performed using the original query, and a list of hits is 
20 returned, each with a relevance score. Secondly, a second scoring pass is made, using 
the information found in the high scoring documents. 
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Because document databases can be huge, it is desirable to represent the 
databases in a way that minimizes media space. Commonly, internal data in a database is 
represented by indexes. Note that the indexes for the two relevancy passes described 
above are usually different. The first relevancy pass usually uses what is known as an 
5 inverted index, meaning that a given term is associated with a list of documents 

containing the term. In the second index, a given document is associated with a list of 
terms appearing in it. The result is that a two pass system consumes roughly double the 
media space of a one-pass system. What is needed is a system that delivers the retrieval 
performance of the two-pass system without consuming as much media space. 

10 SUMMARY OF THE INVENTION 

Disclosed is a method of indexing a database of documents, comprising providing 
a vocabulary of n terms; indexing the database in the form of a non-negative nxm index 
matrix V, wherein m is equal to the number of documents in the database, n is equal to 
the number of terms used to represent the database, and the value of each element v y - of 
1 5 index matrix V is a function of the number of occurrences of the i th vocabulary term in the 
j th document; factoring out non- negative matrix factors Tand D such that F« TD; and 
wherein T is an n x r term matrix, D is an r x m document matrix, and r < nm/(n+m). 
In another aspect of the invention, the index matrix Fis deleted. 
In another aspect of the invention, the term matrix Tis deleted. 
20 In another aspect of the invention, r is at least one order of magnitude smaller 

than n. 
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10 


In another aspect of the invention, r is from two to three orders of magnitude 
smaller than n. 

In another aspect of the invention, entries of said document matrix D falling 
below a predetermined threshold value t are set to zero. 

In another aspect of the invention, said factoring out of non-negative matrix 
factors Tand D further comprises selecting a cost function and associated update rules 
from the group: 

n m p -I 

cost function F = £ JjT g log(7Z>), y - (TD\ J associated with update rules 

j\ TD h , * X r « > and * ' kj2 f ij (TD% ,costfunction 


associated with update rules 


and lh lk ^Dkk , and cost function 


n m / \ (t t V) 

lV-TD\\ =Y,Y l Wij-i TD )y) associated with update rules D kJ <r- D n -r-f-^r- and 

k ik \tdD t ). ' anc * i terat i ve ly calculating said update rules so as to converge said 

cost function toward a limit until the distance between V and TD is reduced to or beyond 
15 a desired value. 

Disclosed is a database index, comprising anrx/w document matrix D, such that 
V&TD wherein 7 is an n x r term matrix; Fis a non-negative n x m index matrix , 
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wherein each of its m columns represents an j th document having n entries containing the 
value of a function of the number of occurrences of a i th term appearing in said j th 
document; and wherein Tand D are non- negative matrix factors of Fand r < nm/(n+m); 
and wherein each of the m columns of said document matrix D corresponds to said j th 
5 document. 

Disclosed is a method of information retrieval, comprising providing a query 
comprising a plurality of search terms; providing a vocabulary of n terms; performing a 
first pass retrieval through a first database representation and scoring m retrieved 
documents according to relevance to said query; executing a second pass retrieval 

10 through a second database representation and scoring documents retrieved from said first 
pass retrieval so as to generate a final relevancy score for each document; and wherein 
said second database representation comprises an r x m document matrix D, such that F« 
TD wherein ris an n x r term matrix; Fis a non-negative nxm index matrix , wherein 
each of its m columns represents an document having n entries containing the value of 

1 5 a function of the number of occurrences of a i & term of said vocabulary appearing in said 

j th document; and wherein Tand D are non-negative matrix factors of Vmd r < 
nrn/(n+ni); and wherein each of the m columns of said document matrix D corresponds to 
said j th document. 

In another aspect of the invention, the final relevancy score for any j th document is 
20 a function of said j th document s corresponding entry in said document matrix D and the 
corresponding entries in said document matrix D of the K top-scoring documents from 
said first pass retrieval 
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In another aspect of the invention, the relevancy score function for said j th 
document is proportional to a sum of cosine distances between said j th document s 
corresponding entry in said document matrix D and each of said corresponding entries in 
said document matrix D of the K top-scoring documents from said first pass retrieval. 
5 Disclosed are articles of manufacture comprising a computer-usable medium 

having computer-readable program means embodied in said medium for executing the 
methods disclosed herein. 


BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a flow diagram of the overall process of an embodiment of the 
10 invention. 

Figure 2a is a diagram of a term matrix. 
Figure 2b is a diagram of a document matrix. 


DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

Referring to Figure 1, we see a generalized information retrieval process for 
15 retrieving documents from a database such as would be executed by computer-readable 
program code means embodied in a computer-usable medium, such as is well known in 
the art. A query 100, specifying one or more search terms is received by the system and 
utilized in a first pass retrieval 1 10 from a first database representation 180. The first 
database representation is usually in the form of an inverted index , meaning an index of 
20 terms wherein each term is associated with a list of every document in the database 
containing that term. This permits use of a relevance scoring method, such as for 
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example, an Okapi method such as described in S.E. Robertson et al., Okapi at TREC-3, 
Proceedings of the Third Text Retrieval Conference (TREC-3), edited by D.K. Harman, 
NIST Special Publication 500-225 (1995), the teachings of which are incorporated by 
reference herein in their entirety, though the Okapi method is certainly not the exclusive 
5 means of carrying out this operation. Whatever method is used, document relevance 
scores are generated 120 and the system enters a second pass retrieval operation 130. 
The second pass retrieval 130 accesses a second database representation 190 to generate 
a second relevance score, which may be combined with the first generated scores to 
generate a final score 150. In the prior art, the second database representation will be in 
10 the form of an index of documents wherein each document is associated with a list of all 
terms in that document. The use of such a second index will usually double the size of 
the storage requirements for the indices. 

As a practical matter, not all terms will generally be listed because to do so would make 
the index unwieldy without improving performance. Hence common terms, such as the , 

15 a , to , and the like will be excluded where we are speaking of text documents. To effect 

these exclusions, it is common practice to include a vocabulary of searchable terms. 
Only those terms listed in the vocabulary will be eligible to be listed in the indices. 
Referring to Figures 2a and 2b, the invention utilizes a procedure known as non-negative 
matrix factorization (sometimes positive matrix factorization ) to reduce the memory 

20 requirements of the second database representation 190. To do this, an index comprising 
an n x m matrix F(not shown) is first created, the m columns of which each corresponds 
to one of m documents in the database. Each of the n rows corresponds to a term in a 
vocabulary (not shown) comprising n terms. Each entry Vy in the index matrix 
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corresponds to the term frequency (TF) of a i th term in an j th document that is a function 
of the number of times the i th term appears in the j th document. In most cases, the entry 
will simply be equal to the number of times the term appears in the document. In one 
embodiment, a new matrix Kwill be generated whenever there is a change to the 
5 vocabulary or any document, or a document is added or deleted. 

After Fis created, a rank of factorization (RF) r is selected, preferably such that r < n and 
r < m. The RF is used to factor out the n x m V matrix into an n x r term matrix T (Figure 
2a) and an r x m document matrix D (Figure 2b), such that: 

V*TD (1) 

10 wherein the two matrices Tand D have a total of rn+rm entries as compared with the V 
matrix s nm entries. So long as 

r < nm/(n+m) (2) 

the total entries of the matrix factors T 7 , D (and therefore the memory requirements) will 
always be less than the total entries of the index matrix Fand the two matrix factors T 7 , D 
15 will be a compressed version of the index matrix V. After creation of the matrix factors 
T, D, the index matrix Fmay then be deleted and the storage savings realized. In a 
preferred embodiment, the term matrix T may also be deleted for further storage savings. 
Note also that, when a new document is added to the database, it is not necessary to 
generate a new index matrix F- one may simply update the document matrix D. 
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The greatest storage savings will be realized when r =1, but this will not be 
practicable when n or m is large. This is because Equation 1 is not an equality, but rather 
an approximation, which is to say that there is some loss of resolution in the 
compression. Hence, there must be a tradeoff between the desire to compress the index 
5 and the desire to avoid loss of data. Generally speaking, however, r can often be chosen 
to be about one to about four orders of magnitude smaller than «, preferably about two or 
three orders of magnitude smaller. Hence, for a database using tens of thousands to 
millions of words, r values of 100 to 500 will generally suffice. Additional storage 
savings may be realized by approximating small matrix entries that fall below a 

10 predetermined threshold value t with a zero. Typically, one may find that more than 95% 
of the entries in the document matrix D may safely be set to zero without significant loss 
of data resolution. These space savings schemes in combination will typically shrink the 
second index by an order of magnitude. 

Methods for effecting the non-negative matrix factorization include those 

15 described in D.D. Lee et aL, Learning the Parts of Objects by Non-Negative Matrix 

Factorization, Nature, Vol. 401, pp. 788-791, (October 1999), the disclosures of which 
are incorporated by reference herein in their entirety; or those methods described in D.D. 
Lee et aL, Algorithms for Non- Negative Matrix Factorization, Neural Information 
Processing Systems (2000), the disclosures of which are incorporated by reference herein 

20 in their entirety; or any other suitable method. A typical method of carrying out the 

non-negative matrix factorization is to iteratively execute a set of update rules for Tand 
D that causes the following function to converge to a local maximum: 
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The function of Equation 3 represent the probability of generating the V matrix from the 
T and D matrices, because the update rules have the effect of adding Poisson noise to the 
product (TD)y. Equation 3 may also be thought of as a cost function, that increases in 
value as V approaches TD. The update rules are as follows: 


V„ 


t < 


(4b) 


(4c) 


Initial values for the elements of the T and D matrices may be selected by a random 
number generator, with the constraint that none of the elements be negative. Starting 
from non- negative initial conditions for T and D, iteration of the update rules of 
Equations 4 for a non- negative V yields the approximate factorization of Equation 1 by 
converging to a local maximum of the objective function of Equation 2. The fidelity of 
the approximation enters the updates through the quotient Vij/(TD)i J? which approaches 
unity with successive iterations. These update rules preserve the non-negativity of T and 
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D and also constrain the columns of T to sum to unity. By constraining the columns of 
the T matrix to sum to unity, one eliminates the degeneracy associated with the 
invariance of TD under the transformation 


T^TK (5a) 
D -> A~ l D (5b) 

where A is an r x r diagonal matrix. 

Another useful cost function for use with the invention is simply the Euclidean 
distance between V and TD: 


M /=! ( 6 ) 


10 which will vanish as V approaches TD and, therefore, will converge to a minimum upon 
iteration of the following update rules: 


* J Y?Dlj (7a) 


T ^ T ^J^l (7b) 


YOR9-2001-0230 (8728-504) 


-10- 


Another useful cost function, similar to Equation 3, is: 


1=1 M 


(8) 


but is unlike Equation 3 in that it vanishes as V approaches TD under the following 
update rules; 


T V 

D ^ D ±^k 


(9a) 


(9b) 


Whether the cost function used goes to a maximum or a minimum, the 
convergence to a limit can be said to be a measure of the distance between V and TD for 
the purposes of this disclosure, though technically only the cost function of Equation 6 is 
10 an actual Euclidean distance. Hence, for convenience, we describe the convergence of 
the cost function to an upper or lower limit as a minimization of the distance between V 
and TD. Iteration of the update rules continues until the distance between Vand TD is 
reduce to or beyond a desired value. 
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From the term and data matrices, Tmd D, elements of the index matrix Fmay 
always be approximately recovered for the purposes of executing a second pass, but a 
preferred method of executing second pass scoring allows deletion of the term matrix T 
for further storage savings. In this method, the T top-scoring documents from the first 
5 pass are listed and the information stored. For best performance, the number T chosen 

will vary according to the size of the database. As a rule of thumb, T will be chosen to be 
from 1 to 20, more preferably from 2 to 5, for every 20,000 to 25,000 or so documents in 
the database. 

In the preferred method of the second pass, the score S for each j th document in 
10 the second pass will be a function of it s column entry Dj 200 in the document matrix D 
and the entries for each 7 th top scoring document D y : 

S j =AD j ,D rl ,D r2 ...D t = r ) (10) 

There are various ways to compute the value of Sj, one of which is 
cosine-distance based wherein the score of a document is proportional to the summation 
1 5 of cosine distances between Dj and the T individual vectors D y . As can be seen, the 
values in the term matrix Tare not needed for this method. 

It is to be understood that, while the invention has been disclosed with regard to 
two-pass systems, that this is for illustrative purposes only and the teachings of this 
invention are applicable to systems of any number of passes, any number of which passes 
20 may utilize the non-negative matrix factorization indexing taught herein. 
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It is to be understood that all physical quantities disclosed herein, unless explicitly 
indicated otherwise, are not to be construed as exactly equal to the quantity disclosed, but 
rather about equal to the quantity disclosed. Further, the mere absence of a qualifier such 
as about" or the like, is not to be construed as an explicit indication that any such 
5 disclosed physical quantity is an exact quantity, irrespective of whether such qualifiers 

are used with respect to any other physical quantities disclosed herein. 

While preferred embodiments have been shown and described, various 
modifications and substitutions may be made thereto without departing from the spirit 
and scope of the invention. Accordingly, it is to be understood that the present invention 
10 has been described by way of illustration only, and such illustrations and embodiments as 
have been disclosed herein are not to be construed as limiting to the claims. 
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